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1 Prefetching using Markov predictors 
Doug Joseph, Dirk Grunwald 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th annual 

international symposium on Computer architecture, volume 25 issue 2 
Full text available: || ] pdf(1.68 MB) Additional Information: full citation , abstract , references , citings , index terms 

Prefetching is one approach to reducing the latency of memory operations in modern 
computer systems. In this paper, we describe the Markov prefetches This prefetcher acts as 
an interface between the on-chip and off-chip cache, and can be added to existing computer 
designs. The Markov prefetcher is distinguished by prefetching multiple reference predictions 
from the memory subsystem, and then prioritizing the delivery of those references to the 
processor.This design results in a pr ... 



2 Using thread-level speculation to simplify manual parallelization Q 
Manohar K. Prabhu, Kunle Olukotun 

June 2003 ACM SIGPLAN Notices , Proceedings of the ninth ACM SIGPLAN symposium 

on Principles and practice of parallel programming, volume 38 issue 10 
Full text available: *f| |pdf( 440.50 KB) Additional Information: full citation , abstract , references , citings , index terms 

In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual 
parallelization and enhances its performance. A number of techniques for manual 
parallelization using TLS are presented and results are provided that indicate the 
performance contribution of each technique on seven SPEC CPU2000 benchmark 
applications. We also provide indications of the programming effort required to parallelize 
each benchmark. TLS parallelization yielded a 110% speedup on our four ... 

Keywords: chip multiprocessor, data speculation, feedback-driven optimization, manual 
parallel programming, multithreading 
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1 Efficient load balancing for wide-area divide-and-conquer applications 
Rob V. van Nieuwpoort, Thilo Kielmann, Henri E. Bal 

June 2001 ACM SIGPLAN Notices , Proceedings of the eighth ACM SIGPLAN symposium 

on Principles and practices of parallel programming, volume 36 issue i 
Full text available- B P| pdf(118 89 KB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

Divide-and-conquer programs are easily parallelized by letting the programmer annotate 
potential parallelism in the form of spawn and sync constructs. To achieve efficient program 
execution, the generated work load has to be balanced evenly among the available CPUs. 
For single cluster systems, Random Stealing (RS) is known to achieve optimal load 
balancing. However, RS is inefficient when applied to hierarchical wide-area systems where 
m ... 

Keywords: Java, clustered wide-area systems, distributed supercomputing 



2 A new method to make communication latency uniform: distributed routing balancing H 
D. Franco, I. Garces, E. Luque 

May 1999 Proceedings of the 13th international conference on Supercomputing 

Full text available: pdf(1.24 MB) Additional Information: full citation , references , citings , index terms 



Keywords: adaptive routing, distributed routing balancing, hot spot avoidance, 
interconnection networks, random routing, traffic distribution, uniform latency 



3 Processor microarchitecture I: Partitioned first-level cache design for clustered 
microarchitectures 
Paul Racunas, Yale N. Patt 

June 2003 Proceedings of the 17th annual international conference on 
Supercomputing 

Full text available' 4ffl | P df(l91 74 KB) Additional Information: full citation , abstract , references , citings , index 

terms 

The high clock frequencies of modern superscalar processors make the wire delay incurred 
in moving data across the processor chip a significant concern. As frequencies continue to 
increase, it will become more difficult for a centralized first level data cache to supply the 



http://portal.acm.org/results.cfm?coll=ACM&dl=ACM&CFro=504350 8/10/05 



Results (page 1): +bandwidth +latency +delay +load -(-balance 



Page 2 of 6 



timely data bandwidth required by superscalar processors.This paper presents a complete 
solution for the partitioning of the first level of the memory hierarchy. The first level data 
cache is split into several independent pa ... 

Keywords: clustered microarchitecture, partitioned cache 



Improved methods for hiding latency in high bandwidth networks (extended abstract) 
Matthew Andrews, Tom Leighton, P. Takis Metaxas, Lisa Zhang 

June 1996 Proceedings of the eighth annual ACM symposium on Parallel algorithms 
and architectures 

Full text available: f£| pdf(1.11 MB) Additional Information: full citation , references , citings , index terms 



Accounting for memory bank contention and delay in high-bandwidth multiprocessors jjj] 
Guy E. Blelloch, Phillip B. Gibbons, Yossi Matias, Marco Zagha 

July 1995 Proceedings of the seventh annual ACM symposium on Parallel algorithms 
and architectures 

Full text available: ^g jpdfd.26 MB) Additional Information: full citation , references , citings , index terms 



6 Traffic-based Load Balance for Scalable Network Emulation H 
Xin Liu, Andrew A. Chien 

November 2003 Proceedings of the 2003 ACM/IEEE conference on Supercomputing 

Full text available: *^ pdf(191.58 KB) Additional Information: full citation , abstract 

Load balance is critical to achieving scalability for large network emulation studies, which 
are of compelling interest for emerging Grid, Peer to Peer, and other distributed applications 
and middleware. Achieving load balance in emulation is difficult because of irregular 
network structure and unpredictable network traffic. We formulate load balance as a graph 
partitioning problem and apply classical graph partitioning algorithms to it. The primary 
challenge in this approach is how to extract u ... 

7 Simulation of a DQDB MAC protocol with movable boundary and bandwidth balancing §§§ 
mechanisms 

S. Popovich, M. Alam, S. Bandyopadhyay 

April 1992 Proceedings of the 25th annual symposium on Simulation 

Full text available: " gpdfd.OO MB) Additional Information: full citation , references , index terms 



GOAL: a load-balanced adaptive routing algorithm for torus networks 
Arjun Singh, William J. Dally, Amit K. Gupta, Brian Towles 

May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture, volume 31 issue 2 
Full text available: pdf(260.66 KB) Additional Information: full citation , abstract , references , citings 

We introduce a load-balanced adaptive routing algorithm for torus networks, GOAL - 
Globally Oblivious Adaptive Locally - that provides high throughput on adversarial traffic 
patterns, matching or exceeding fully randomized routing and exceeding the worst-case 
performance of Chaos [2], RLB [14], and minimal routing [8] by more than 40%. GOAL also 
preserves locality to provide up to 4.6x the throughput of fully randomized routing [19] on 
local traffic. GOAL achieves global load balance by ra ... 
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9 Balanced scheduling: instruction scheduling when memory latency is uncertain 
Daniel R. Kerns, Susan J. Eggers 

June 1993 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1993 conference 

on Programming language design and implementation, volume 28 issue 6 
Full text available* t W |pdf(1.24 MB) Additional Information: full citation , abstract , references , ci tings , index 
^ terms 

Traditional list schedulers order instructions based on an optimistic estimate of the load 
delay imposed by the implementation. Therefore they cannot respond to variations in load 
latencies (due to cache hits or misses, congestion in the memory interconnect, etc.) and 
cannot easily be applied across different implementations. We have developed an 
alternative algorithm, known as balanced scheduling, that schedule instructions based on 
an estimate of the amount of instruction level parallelis ... 

10 Comparing random data allocation and data striping in multimedia servers 
Jose Renato Santos, Richard R. Muntz, Berthier Ribeiro-Neto 

June 2000 ACM SIG METRICS Performance Evaluation Review , Proceedings of the 
2000 ACM SIG METRICS international conference on Measurement and 
modeling of computer systems, volume 28 issue i 

Full text available- 1 f|pdf(1.18 MB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

We compare performance of a multimedia storage server based on a random data allocation 
layout and block replication with traditional data striping techniques. Data striping 
techniques in multimedia servers are often designed for restricted workloads, e.g. 
sequential access patterns with CBR (constant bit rate) requirements. On the other hand, a 
system based on random data allocation can support virtually any type of multimedia 
application, including VBR (variable bit rate) video or audio, ... 

11 An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches 
Changkyu Kim, Doug Burger, Stephen W. Keckler 

October 2002 Proceedings of the 10th international conference on Architectural 

support for programming languages and operating systems, volume 37 , 30 , 

36 Issue 10 , 5 , 5 

Full text available: *g [pdf(1.33 MB) Additional Information: full citation , abstract , references , citings 

Growing wire delays will force substantive changes in the designs of large caches. 
Traditional cache architectures assume that each level in the cache hierarchy has a single, 
uniform access time. Increases in on-chip communication delays will make the hit time of 
large on-chip caches a function of a line's physical location within the cache. Consequently, 
cache access times will become a continuum of latencies rather than a single discrete 
latency. This non-uniformity can be exploited to provide ... 

12 Rotating combined gueueinq (RCQ): bandwidth and latency guarantees in low-cost, J 
high-performance networks 

Jae H. Kim, Andrew A. Chien 

May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture, volume 24 issue 2 
Full text available- f HpdfM 39 MB) Additional Information: full citation , abstract , references , citings , index 
' l ^ fcL " L " terms 

Network service guarantees not only provide significant performance benefits to distributed 
computing systems (more balanced resource utilization, fast fault recovery, and fair 
network access), but they are also essential for many new applications requiring real-time 
communications with continuous data types (audio/video). Most existing algorithms which 
provide network service guarantees are too complicated to be feasible in high-speed, low- 
cost switches for multicomputer networks. The simpler a ... 
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13 Load-balanced location management for cellular mobile systems using quorums and 
dynamic hashing 

Ravi Prakash, Zygmunt Haas, Mukesh Singhal 
September 2001 Wireless Networks, Volume 7 issue s 

Full text available: ^ pdf(282.87 KB) Additional Information: full citation , abstract , references , index terms 

This paper presents a new distributed location management strategy for cellular mobile 
systems. Its salient features are fast location update and query, load balancing among 
location servers, and scalability. The strategy employs dynamic hashing techniques and 
quorums to manage location update and query operations. The proposed strategy does not 
require a home location register (HLR) to be associated with each mobile node. Location 
updates and queries for a mobile node are multicast to subsets o ... 

Keywords: distributed location management, dynamic hashing, location independent 
numbering, mobile computing, quorum systems, tries 




14 Effects of communication latency, overhead, and bandwidth in a cluster architecture 
Richard P. Martin, Amin M. Vahdat, David E. Culler, Thomas E. Anderson 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture, volume 25 issue 2 
Full text available- 'P] pdf(2.08 MB) Additional Information: full citation , abstract , references , citings, index 

terms 

This work provides a systematic study of the impact of communication performance on 
parallel applications in a high performance network of workstations. We develop an 
experimental system in which the communication latency, overhead, and bandwidth can be 
independently varied to observe the effects on a wide range of applications. Our results 
indicate that current efforts to improve cluster communication performance to that of 
tightly integrated parallel machines results in significantly improved ... 

15 Parallel and distributed systems and networking: Analysis of Distributed Routing 

Balancing behavior 
I. Garces, D. Franco 

March 2002 Proceedings of the 2002 ACM symposium on Applied computing 

Full text available: ^ pdf(613.16 KB) Additional Information: full citation , abstract , references , index terms 

Distributed Routing Balancing (DRB) is a method developed to uniformly balance 
communication traffic over the interconnection network. DRB takes a similar approach to 
communications as load balancing does to processes in a distributed environment. The key 
ideas behind DRB are to distribute communication load based on limited and load-controlled 
path expansion, in order to maintain low message latency. In this paper, we present an 
exhaustive evaluation of DRB that shows that the method presents I ... 

Keywords: adaptive routing, distributed routing balancing, hot spot avoidance, 
interconnection networks, traffic distribution 



16 Reducing wire delay penalty through value prediction 
Joan-Manuel Parcerisa, Antonio Gonzalez 

December 2000 Proceedings of the 33rd annual ACM/IEEE international symposium on 
Microarchitecture 

Full text available: f ?| pdf(148.85 KB) 

I ps(401.62 KB) Additional Information: full citation , references , ci tings , index terms 
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17 Diffracting trees 

Nir Shavit, Asaph Zemach 

November 1996 ACM Transactions on Computer Systems (TOCS), Volume 14 issue 4 

Full text available - W\ pdf(729 57 KB) Additional Information: full citation , abstract , references , citings , index 
^ ' terms 

Shared counters are among the most basic coordination structures in multiprocessor 
conputation, with applications ranging from barrier synchronization to concurrent-data- 
structure design. This article introduces diffracting trees, novel data structures for share 
counting and load balancing in a distributed/parallel environment. Empirical evidence, 
collected on a simulated distributed shared-memory machine and several simulated 
message-passing architectures, shows that diffracting trees seal ... 

Keywords: contention, counting networks, index distribution, lock free, wait free 



18 Fingerprinting: bounding soft-error detection latency and bandwidth | 
Jared C. Smolens, Brian T. Gold, Jangwoo Kim, Babak Falsafi, James C. Hoe, Andreas G. 
Nowatzyk 

October 2004 Proceedings of the 11th international conference on Architectural 

support for programming languages and operating systems, volume 39 , 32 , 

38 Issue 11 , 5 , 5 

Full text available: ^ pdf(229.65 KB) Additional Information: full citation , abstract , references , index terms 

Recent studies have suggested that the soft-error rate in microprocessor logic will become a 
reliability concern by 2010. This paper proposes an efficient error detection technique, 
called fingerprinting, that detects differences in execution across a dual modular redundant 
(DMR) processor pair. Fingerprinting summarizes a processor's execution history in a hash- 
based signature; differences between two mirrored processors are exposed by comparing 
their fingerprints. Fingerprinting tightly ... 

Keywords: backwards error recovery (BER), dual modular redundancy (DMR), error 
detection, soft errors 



Efficient fair queueing using deficit round-robin 
M. Shreedhar, George Varghese 

June 1996 IEEE/ACM Transactions on Networking (TON), volume 4 issue 3 

Full text available:*!! pdfd. 57 MB) Additional Information: full citation , references , citings , index terms 



20 A two-tier heterogeneous mobile Ad Hoc network architecture and its load-balance 
routing problem 

Chi-Fu Huang, Hung-Wei Lee, Yu-Chee Tseng 

August 2004 Mobile Networks and Applications, Volume 9 Issue 4 

Full text available: ^ pdfd. 28 MB) Additional Information: full citation , abstract , references , index terms 

The mobile ad hoc network (MANET) has attracted a lot of interest recently. However, most 
of the existing works have assumed a stand-alone MANET. In this paper, we propose a two- 
tier, heterogeneous MANET architecture which can support Internet access. The low tier of 
the network consists of a set of mobile hosts each equipped with a IEEE 802.11 wireless 
LAN card. In order to connect to the Internet and handle the network partitioning problem, 
we propose that the high tier is comprised of a subse ... 
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